Greater Poland Province
Appendix 1 A Spectral Analysis and L TI-SDE
The chain structure is also convenient to handle streaming data as we will explain later. We first give a brief introduction to the EP and CEP framework. Step 2. We construct a tilted distribution to combine the true likelihood, Step 3. We project the tilted distribution back to the exponential family, q KL( null p nullq) where q belongs to the exponential family. Step 4. We update the approximation term by's in parallel, and uses damping to avoid divergence. The above computation are very conveniently to implement.
- Asia > China > Beijing > Beijing (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- North America > United States > Utah (0.05)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (3 more...)
- North America > United States (0.14)
- North America > Dominican Republic (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- (3 more...)
- Research Report (0.68)
- Overview (0.46)
- Summary/Review (0.46)
- Oceania > Australia > New South Wales > Sydney (0.14)
- Europe > Poland > Greater Poland Province > Poznań (0.05)
- North America > Canada > Quebec > Montreal (0.05)
- (16 more...)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- Africa > Senegal > Kolda Region > Kolda (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- Energy (0.46)
- Government > Regional Government (0.46)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.14)
- Europe > Romania (0.04)
- Europe > United Kingdom > England (0.04)
- (19 more...)
- Health & Medicine > Therapeutic Area > Endocrinology (1.00)
- Education (1.00)
- Banking & Finance (0.92)
- (3 more...)
Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging
Meterez, Alexandru, Nair, Pranav Ajit, Morwani, Depen, Pehlevan, Cengiz, Kakade, Sham
Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization
Liao, Junyi, Zhu, Zihan, Fang, Ethan, Yang, Zhuoran, Tarokh, Vahid
Estimating the unknown reward functions driving agents' behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players' strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States > Pennsylvania (0.04)
- (3 more...)
- Leisure & Entertainment > Games (1.00)
- Transportation (0.92)
- Information Technology (0.67)
More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization
Ichikawa, Yuma, Fujisawa, Yoshihiko, Fujimoto, Yudai, Sakai, Akira, Fujisawa, Katsuki
For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.
- North America > United States (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)